Robotics 42
☆ DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning
We present DSDrive, a streamlined end-to-end paradigm tailored for
integrating the reasoning and planning of autonomous vehicles into a unified
framework. DSDrive leverages a compact LLM that employs a distillation method
to preserve the enhanced reasoning capabilities of a larger-sized vision
language model (VLM). To effectively align the reasoning and planning tasks, a
waypoint-driven dual-head coordination module is further developed, which
synchronizes dataset structures, optimization objectives, and the learning
process. By integrating these tasks into a unified framework, DSDrive anchors
on the planning results while incorporating detailed reasoning insights,
thereby enhancing the interpretability and reliability of the end-to-end
pipeline. DSDrive has been thoroughly tested in closed-loop simulations, where
it performs on par with benchmark models and even outperforms in many key
metrics, all while being more compact in size. Additionally, the computational
efficiency of DSDrive (as reflected in its time and memory requirements during
inference) has been significantly enhanced. Evidently thus, this work brings
promising aspects and underscores the potential of lightweight systems in
delivering interpretable and efficient solutions for AD.
☆ Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects
The rapid adoption of Vision Language Models (VLMs), pre-trained on large
image-text and video-text datasets, calls for protecting and informing users
about when to trust these systems. This survey reviews studies on trust
dynamics in user-VLM interactions, through a multi-disciplinary taxonomy
encompassing different cognitive science capabilities, collaboration modes, and
agent behaviours. Literature insights and findings from a workshop with
prospective VLM users inform preliminary requirements for future VLM trust
studies.
☆ CottonSim: Development of an autonomous visual-guided robotic cotton-picking system in the Gazebo
Thevathayarajh Thayananthan, Xin Zhang, Yanbo Huang, Jingdao Chen, Nuwan K. Wijewardane, Vitor S. Martins, Gary D. Chesser, Christopher T. Goodin
In this study, an autonomous visual-guided robotic cotton-picking system,
built on a Clearpath's Husky robot platform and the Cotton-Eye perception
system, was developed in the Gazebo robotic simulator. Furthermore, a virtual
cotton farm was designed and developed as a Robot Operating System (ROS 1)
package to deploy the robotic cotton picker in the Gazebo environment for
simulating autonomous field navigation. The navigation was assisted by the map
coordinates and an RGB-depth camera, while the ROS navigation algorithm
utilized a trained YOLOv8n-seg model for instance segmentation. The model
achieved a desired mean Average Precision (mAP) of 85.2%, a recall of 88.9%,
and a precision of 93.0% for scene segmentation. The developed ROS navigation
packages enabled our robotic cotton-picking system to autonomously navigate
through the cotton field using map-based and GPS-based approaches, visually
aided by a deep learning-based perception system. The GPS-based navigation
approach achieved a 100% completion rate (CR) with a threshold of 5 x 10^-6
degrees, while the map-based navigation approach attained a 96.7% CR with a
threshold of 0.25 m. This study establishes a fundamental baseline of
simulation for future agricultural robotics and autonomous vehicles in cotton
farming and beyond. CottonSim code and data are released to the research
community via GitHub: https://github.com/imtheva/CottonSim
comment: 45 pages, 15 figures, 4 tables
☆ Localization and path following for an autonomous e-scooter
David Meister, Robin Strässer, Felix Brändle, Marc Seidel, Benno Bassler, Nathan Gerber, Jan Kautz, Elena Rommel, Frank Allgöwer
In order to mitigate economical, ecological, and societal challenges in
electric scooter (e-scooter) sharing systems, we develop an autonomous
e-scooter prototype. Our vision is to design a fully autonomous prototype that
can find its way to the next parking spot, high-demand area, or charging
station. In this work, we propose a path following solution to enable
localization and navigation in an urban environment with a provided path to
follow. We design a closed-loop architecture that solves the localization and
path following problem while allowing the e-scooter to maintain its balance
with a previously developed reaction wheel mechanism. Our approach facilitates
state and input constraints, e.g., adhering to the path width, while remaining
executable on a Raspberry Pi 5. We demonstrate the efficacy of our approach in
a real-world experiment on our prototype.
☆ PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
Ahmed Abdelreheem, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Abdelrahman Eldesokey, Peter Wonka, Gabriel Brostow, Sara Vicente, Guillermo Garcia-Hernando
We introduce the novel task of Language-Guided Object Placement in Real 3D
Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual
prompt broadly describing where the 3D asset should be placed. The task here is
to find a valid placement for the 3D asset that respects the prompt. Compared
with other language-guided localization tasks in 3D scenes such as grounding,
this task has specific challenges: it is ambiguous because it has multiple
valid solutions, and it requires reasoning about 3D geometric relationships and
free space. We inaugurate this task by proposing a new benchmark and evaluation
protocol. We also introduce a new dataset for training 3D LLMs on this task, as
well as the first method to serve as a non-trivial baseline. We believe that
this challenging task and our new benchmark could become part of the suite of
benchmarks used to evaluate and compare generalist 3D LLM models.
comment: Tech report. Project page: https://nianticlabs.github.io/placeit3d/
☆ Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation
Humans naturally exhibit bilateral symmetry in their gross manipulation
skills, effortlessly mirroring simple actions between left and right hands.
Bimanual robots-which also feature bilateral symmetry-should similarly exploit
this property to perform tasks with either hand. Unlike humans, who often favor
a dominant hand for fine dexterous skills, robots should ideally execute
ambidextrous manipulation with equal proficiency. To this end, we introduce
SYMDEX (SYMmetric DEXterity), a reinforcement learning framework for
ambidextrous bi-manipulation that leverages the robot's inherent bilateral
symmetry as an inductive bias. SYMDEX decomposes complex bimanual manipulation
tasks into per-hand subtasks and trains dedicated policies for each. By
exploiting bilateral symmetry via equivariant neural networks, experience from
one arm is inherently leveraged by the opposite arm. We then distill the
subtask policies into a global ambidextrous policy that is independent of the
hand-task assignment. We evaluate SYMDEX on six challenging simulated
manipulation tasks and demonstrate successful real-world deployment on two of
them. Our approach strongly outperforms baselines on complex task in which the
left and right hands perform different roles. We further demonstrate SYMDEX's
scalability by extending it to a four-arm manipulation setup, where our
symmetry-aware policies enable effective multi-arm collaboration and
coordination. Our results highlight how structural symmetry as inductive bias
in policy learning enhances sample efficiency, robustness, and generalization
across diverse dexterous manipulation tasks.
☆ Multi-Objective Reinforcement Learning for Adaptive Personalized Autonomous Driving
Human drivers exhibit individual preferences regarding driving style.
Adapting autonomous vehicles to these preferences is essential for user trust
and satisfaction. However, existing end-to-end driving approaches often rely on
predefined driving styles or require continuous user feedback for adaptation,
limiting their ability to support dynamic, context-dependent preferences. We
propose a novel approach using multi-objective reinforcement learning (MORL)
with preference-driven optimization for end-to-end autonomous driving that
enables runtime adaptation to driving style preferences. Preferences are
encoded as continuous weight vectors to modulate behavior along interpretable
style objectives$\unicode{x2013}$including efficiency, comfort, speed, and
aggressiveness$\unicode{x2013}$without requiring policy retraining. Our
single-policy agent integrates vision-based perception in complex mixed-traffic
scenarios and is evaluated in diverse urban environments using the CARLA
simulator. Experimental results demonstrate that the agent dynamically adapts
its driving behavior according to changing preferences while maintaining
performance in terms of collision avoidance and route completion.
☆ Online Velocity Profile Generation and Tracking for Sampling-Based Local Planning Algorithms in Autonomous Racing Environments
This work presents an online velocity planner for autonomous racing that
adapts to changing dynamic constraints, such as grip variations from tire
temperature changes and rubber accumulation. The method combines a
forward-backward solver for online velocity optimization with a novel spatial
sampling strategy for local trajectory planning, utilizing a three-dimensional
track representation. The computed velocity profile serves as a reference for
the local planner, ensuring adaptability to environmental and vehicle dynamics.
We demonstrate the approach's robust performance and computational efficiency
in racing scenarios and discuss its limitations, including sensitivity to
deviations from the predefined racing line and high jerk characteristics of the
velocity profile.
comment: 8 Pages, accepted to be published at the the IEEE Intelligent
Vehicles Symposium (IV 2025), June 22-25 in Cluj, Romania
☆ X-Driver: Explainable Autonomous Driving with Vision-Language Models
End-to-end autonomous driving has advanced significantly, offering benefits
such as system simplicity and stronger driving performance in both open-loop
and closed-loop settings than conventional pipelines. However, existing
frameworks still suffer from low success rates in closed-loop evaluations,
highlighting their limitations in real-world deployment. In this paper, we
introduce X-Driver, a unified multi-modal large language models(MLLMs)
framework designed for closed-loop autonomous driving, leveraging
Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and
decision-making. We validate X-Driver across multiple autonomous driving tasks
using public benchmarks in CARLA simulation environment, including
Bench2Drive[6]. Our experimental results demonstrate superior closed-loop
performance, surpassing the current state-of-the-art(SOTA) while improving the
interpretability of driving decisions. These findings underscore the importance
of structured reasoning in end-to-end driving and establish X-Driver as a
strong baseline for future research in closed-loop autonomous driving.
★ The City that Never Settles: Simulation-based LiDAR Dataset for Long-Term Place Recognition Under Extreme Structural Changes
Large-scale construction and demolition significantly challenge long-term
place recognition (PR) by drastically reshaping urban and suburban
environments. Existing datasets predominantly reflect limited or indoor-focused
changes, failing to adequately represent extensive outdoor transformations. To
bridge this gap, we introduce the City that Never Settles (CNS) dataset, a
simulation-based dataset created using the CARLA simulator, capturing major
structural changes-such as building construction and demolition-across diverse
maps and sequences. Additionally, we propose TCR_sym, a symmetric version of
the original TCR metric, enabling consistent measurement of structural changes
irrespective of source-target ordering. Quantitative comparisons demonstrate
that CNS encompasses more extensive transformations than current real-world
benchmarks. Evaluations of state-of-the-art LiDAR-based PR methods on CNS
reveal substantial performance degradation, underscoring the need for robust
algorithms capable of handling significant environmental changes. Our dataset
is available at https://github.com/Hyunho111/CNS_dataset.
☆ Visual Affordances: Enabling Robots to Understand Object Functionality
Human-robot interaction for assistive technologies relies on the prediction
of affordances, which are the potential actions a robot can perform on objects.
Predicting object affordances from visual perception is formulated differently
for tasks such as grasping detection, affordance classification, affordance
segmentation, and hand-object interaction synthesis. In this work, we highlight
the reproducibility issue in these redefinitions, making comparative benchmarks
unfair and unreliable. To address this problem, we propose a unified
formulation for visual affordance prediction, provide a comprehensive and
systematic review of previous works highlighting strengths and limitations of
methods and datasets, and analyse what challenges reproducibility. To favour
transparency, we introduce the Affordance Sheet, a document to detail the
proposed solution, the datasets, and the validation. As the physical properties
of an object influence the interaction with the robot, we present a generic
framework that links visual affordance prediction to the physical world. Using
the weight of an object as an example for this framework, we discuss how
estimating object mass can affect the affordance prediction. Our approach
bridges the gap between affordance perception and robot actuation, and accounts
for the complete information about objects of interest and how the robot
interacts with them to accomplish its task.
comment: 24 pages, 12 figures, 10 tables. Project website at
https://apicis.github.io/aff-survey/
☆ CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations
Learning robot policies using imitation learning requires collecting large
amounts of costly action-labeled expert demonstrations, which fundamentally
limits the scale of training data. A promising approach to address this
bottleneck is to harness the abundance of unlabeled observations-e.g., from
video demonstrations-to learn latent action labels in an unsupervised way.
However, we find that existing methods struggle when applied to complex robot
tasks requiring fine-grained motions. We design continuous latent action models
(CLAM) which incorporate two key ingredients we find necessary for learning to
solve complex continuous control tasks from unlabeled observation data: (a)
using continuous latent action labels instead of discrete representations, and
(b) jointly training an action decoder to ensure that the latent action space
can be easily grounded to real actions with relatively few labeled examples.
Importantly, the labeled examples can be collected from non-optimal play data,
enabling CLAM to learn performant policies without access to any action-labeled
expert data. We demonstrate on continuous control benchmarks in DMControl
(locomotion) and MetaWorld (manipulation), as well as on a real WidowX robot
arm that CLAM significantly outperforms prior state-of-the-art methods,
remarkably with a 2-3x improvement in task success rate compared to the best
baseline. Videos and code can be found at clamrobot.github.io.
comment: Latent Action Models, Self-supervised Pretraining, Learning from
Videos
☆ CPP-DIP: Multi-objective Coverage Path Planning for MAVs in Dispersed and Irregular Plantations
Coverage Path Planning (CPP) is vital in precision agriculture to improve
efficiency and resource utilization. In irregular and dispersed plantations,
traditional grid-based CPP often causes redundant coverage over non-vegetated
areas, leading to waste and pollution. To overcome these limitations, we
propose CPP-DIP, a multi-objective CPP framework designed for Micro Air
Vehicles (MAVs). The framework transforms the CPP task into a Traveling
Salesman Problem (TSP) and optimizes flight paths by minimizing travel
distance, turning angles, and intersection counts. Unlike conventional
approaches, our method does not rely on GPS-based environmental modeling.
Instead, it uses aerial imagery and a Histogram of Oriented Gradients
(HOG)-based approach to detect trees and extract image coordinates. A
density-aware waypoint strategy is applied: Kernel Density Estimation (KDE) is
used to reduce redundant waypoints in dense regions, while a greedy algorithm
ensures complete coverage in sparse areas. To verify the generality of the
framework, we solve the resulting TSP using three different methods: Greedy
Heuristic Insertion (GHI), Ant Colony Optimization (ACO), and Monte Carlo
Reinforcement Learning (MCRL). Then an object-based optimization is applied to
further refine the resulting path. Additionally, CPP-DIP integrates ForaNav,
our insect-inspired navigation method, for accurate tree localization and
tracking. The experimental results show that MCRL offers a balanced solution,
reducing the travel distance by 16.9 % compared to ACO while maintaining a
similar performance to GHI. It also improves path smoothness by reducing
turning angles by 28.3 % and 59.9 % relative to ACO and GHI, respectively, and
effectively eliminates intersections. These results confirm the robustness and
effectiveness of CPP-DIP in different TSP solvers.
★ A Vehicle System for Navigating Among Vulnerable Road Users Including Remote Operation
Oscar de Groot, Alberto Bertipaglia, Hidde Boekema, Vishrut Jain, Marcell Kegl, Varun Kotian, Ted Lentsch, Yancong Lin, Chrysovalanto Messiou, Emma Schippers, Farzam Tajdari, Shiming Wang, Zimin Xia, Mubariz Zaffar, Ronald Ensing, Mario Garzon, Javier Alonso-Mora, Holger Caesar, Laura Ferranti, Riender Happee, Julian F. P. Kooij, Georgios Papaioannou, Barys Shyrokau, Dariu M. Gavrila
We present a vehicle system capable of navigating safely and efficiently
around Vulnerable Road Users (VRUs), such as pedestrians and cyclists. The
system comprises key modules for environment perception, localization and
mapping, motion planning, and control, integrated into a prototype vehicle. A
key innovation is a motion planner based on Topology-driven Model Predictive
Control (T-MPC). The guidance layer generates multiple trajectories in
parallel, each representing a distinct strategy for obstacle avoidance or
non-passing. The underlying trajectory optimization constrains the joint
probability of collision with VRUs under generic uncertainties. To address
extraordinary situations ("edge cases") that go beyond the autonomous
capabilities - such as construction zones or encounters with emergency
responders - the system includes an option for remote human operation,
supported by visual and haptic guidance. In simulation, our motion planner
outperforms three baseline approaches in terms of safety and efficiency. We
also demonstrate the full system in prototype vehicle tests on a closed track,
both in autonomous and remotely operated modes.
comment: Intelligent Vehicles Symposium 2025
☆ LVLM-MPC Collaboration for Autonomous Driving: A Safety-Aware and Task-Scalable Control Architecture
This paper proposes a novel Large Vision-Language Model (LVLM) and Model
Predictive Control (MPC) integration framework that delivers both task
scalability and safety for Autonomous Driving (AD). LVLMs excel at high-level
task planning across diverse driving scenarios. However, since these foundation
models are not specifically designed for driving and their reasoning is not
consistent with the feasibility of low-level motion planning, concerns remain
regarding safety and smooth task switching. This paper integrates LVLMs with
MPC Builder, which automatically generates MPCs on demand, based on symbolic
task commands generated by the LVLM, while ensuring optimality and safety. The
generated MPCs can strongly assist the execution or rejection of LVLM-driven
task switching by providing feedback on the feasibility of the given tasks and
generating task-switching-aware MPCs. Our approach provides a safe, flexible,
and adaptable control framework, bridging the gap between cutting-edge
foundation models and reliable vehicle operation. We demonstrate the
effectiveness of our approach through a simulation experiment, showing that our
system can safely and effectively handle highway driving while maintaining the
flexibility and adaptability of LVLMs.
comment: 8 pages, 8 figures
☆ Robust Model-Based In-Hand Manipulation with Integrated Real-Time Motion-Contact Planning and Tracking
Robotic dexterous in-hand manipulation, where multiple fingers dynamically
make and break contact, represents a step toward human-like dexterity in
real-world robotic applications. Unlike learning-based approaches that rely on
large-scale training or extensive data collection for each specific task,
model-based methods offer an efficient alternative. Their online computing
nature allows for ready application to new tasks without extensive retraining.
However, due to the complexity of physical contacts, existing model-based
methods encounter challenges in efficient online planning and handling modeling
errors, which limit their practical applications. To advance the effectiveness
and robustness of model-based contact-rich in-hand manipulation, this paper
proposes a novel integrated framework that mitigates these limitations. The
integration involves two key aspects: 1) integrated real-time planning and
tracking achieved by a hierarchical structure; and 2) joint optimization of
motions and contacts achieved by integrated motion-contact modeling.
Specifically, at the high level, finger motion and contact force references are
jointly generated using contact-implicit model predictive control. The
high-level module facilitates real-time planning and disturbance recovery. At
the low level, these integrated references are concurrently tracked using a
hand force-motion model and actual tactile feedback. The low-level module
compensates for modeling errors and enhances the robustness of manipulation.
Extensive experiments demonstrate that our approach outperforms existing
model-based methods in terms of accuracy, robustness, and real-time
performance. Our method successfully completes five challenging tasks in
real-world environments, even under appreciable external disturbances.
comment: Submitted to the International Journal of Robotics Research (IJRR)
☆ AI and Vision based Autonomous Navigation of Nano-Drones in Partially-Known Environments
The miniaturisation of sensors and processors, the advancements in connected
edge intelligence, and the exponential interest in Artificial Intelligence are
boosting the affirmation of autonomous nano-size drones in the Internet of
Robotic Things ecosystem. However, achieving safe autonomous navigation and
high-level tasks such as exploration and surveillance with these tiny platforms
is extremely challenging due to their limited resources. This work focuses on
enabling the safe and autonomous flight of a pocket-size, 30-gram platform
called Crazyflie 2.1 in a partially known environment. We propose a novel
AI-aided, vision-based reactive planning method for obstacle avoidance under
the ambit of Integrated Sensing, Computing and Communication paradigm. We deal
with the constraints of the nano-drone by splitting the navigation task into
two parts: a deep learning-based object detector runs on the edge (external
hardware) while the planning algorithm is executed onboard. The results show
the ability to command the drone at $\sim8$ frames-per-second and a model
performance reaching a COCO mean-average-precision of $60.8$. Field experiments
demonstrate the feasibility of the solution with the drone flying at a top
speed of $1$ m/s while steering away from an obstacle placed in an unknown
position and reaching the target destination. The outcome highlights the
compatibility of the communication delay and the model performance with the
requirements of the real-time navigation task. We provide a feasible
alternative to a fully onboard implementation that can be extended to
autonomous exploration with nano-drones.
comment: in DCOSS-IoT 2025, Wi-DroIT 2025
☆ An Efficient Method for Accurate Pose Estimation and Error Correction of Cuboidal Objects IROS 2022
The proposed system outlined in this paper is a solution to a use case that
requires the autonomous picking of cuboidal objects from an organized or
unorganized pile with high precision. This paper presents an efficient method
for precise pose estimation of cuboid-shaped objects, which aims to reduce
errors in target pose in a time-efficient manner. Typical pose estimation
methods like global point cloud registrations are prone to minor pose errors
for which local registration algorithms are generally used to improve pose
accuracy. However, due to the execution time overhead and uncertainty in the
error of the final achieved pose, an alternate, linear time approach is
proposed for pose error estimation and correction. This paper presents an
overview of the solution followed by a detailed description of individual
modules of the proposed algorithm.
comment: Accepted in IEEE/RSJ IROS 2022 Workshop on Mobile Manipulation and
Embodied Intelligence (MOMA)
☆ ADD: Physics-Based Motion Imitation with Adversarial Differential Discriminators
Multi-objective optimization problems, which require the simultaneous
optimization of multiple terms, are prevalent across numerous applications.
Existing multi-objective optimization methods often rely on manually tuned
aggregation functions to formulate a joint optimization target. The performance
of such hand-tuned methods is heavily dependent on careful weight selection, a
time-consuming and laborious process. These limitations also arise in the
setting of reinforcement-learning-based motion tracking for physically
simulated characters, where intricately crafted reward functions are typically
used to achieve high-fidelity results. Such solutions not only require domain
expertise and significant manual adjustment, but also limit the applicability
of the resulting reward function across diverse skills. To bridge this gap, we
present a novel adversarial multi-objective optimization technique that is
broadly applicable to a range of multi-objective optimization problems,
including motion tracking. The proposed adversarial differential discriminator
receives a single positive sample, yet is still effective at guiding the
optimization process. We demonstrate that our technique can enable characters
to closely replicate a variety of acrobatic and agile behaviors, achieving
comparable quality to state-of-the-art motion-tracking methods, without relying
on manually tuned reward functions. Results are best visualized through
https://youtu.be/rz8BYCE9E2w.
comment: 19 pages, 15 figures
☆ Real-Time Model Predictive Control of Vehicles with Convex-Polygon-Aware Collision Avoidance in Tight Spaces
This paper proposes vehicle motion planning methods with obstacle avoidance
in tight spaces by incorporating polygonal approximations of both the vehicle
and obstacles into a model predictive control (MPC) framework. Representing
these shapes is crucial for navigation in tight spaces to ensure accurate
collision detection. However, incorporating polygonal approximations leads to
disjunctive OR constraints in the MPC formulation, which require a mixed
integer programming and cause significant computational cost. To overcome this,
we propose two different collision-avoidance constraints that reformulate the
disjunctive OR constraints as tractable conjunctive AND constraints: (1) a
Support Vector Machine (SVM)-based formulation that recasts collision avoidance
as a SVM optimization problem, and (2) a Minimum Signed Distance to Edges
(MSDE) formulation that leverages minimum signed-distance metrics. We validate
both methods through extensive simulations, including tight-space parking
scenarios and varied-shape obstacle courses, as well as hardware experiments on
an RC-car platform. Our results demonstrate that the SVM-based approach
achieves superior navigation accuracy in constrained environments; the MSDE
approach, by contrast, runs in real time with only a modest reduction in
collision-avoidance performance.
comment: 8 pages, 10 figures, 3 tables, The IEEE International Conference on
Intelligent Transportation Systems (ITSC) November 18-21, 2025-Gold Coast,
Australia
☆ CubeDAgger: Improved Robustness of Interactive Imitation Learning without Violation of Dynamic Stability
Interactive imitation learning makes an agent's control policy robust by
stepwise supervisions from an expert. The recent algorithms mostly employ
expert-agent switching systems to reduce the expert's burden by limitedly
selecting the supervision timing. However, the precise selection is difficult
and such a switching causes abrupt changes in actions, damaging the dynamic
stability. This paper therefore proposes a novel method, so-called CubeDAgger,
which improves robustness while reducing dynamic stability violations by making
three improvements to a baseline method, EnsembleDAgger. The first improvement
adds a regularization to explicitly activate the threshold for deciding the
supervision timing. The second transforms the expert-agent switching system to
an optimal consensus system of multiple action candidates. Third,
autoregressive colored noise to the actions is introduced to make the
stochastic exploration consistent over time. These improvements are verified by
simulations, showing that the learned policies are sufficiently robust while
maintaining dynamic stability during interaction.
comment: 7 pages, 4 figures
☆ SatAOI: Delimitating Area of Interest for Swing-Arm Troweling Robot for Construction
In concrete troweling for building construction, robots can significantly
reduce workload and improve automation level. However, as a primary task of
coverage path planning (CPP) for troweling, delimitating area of interest (AOI)
in complex scenes is still challenging, especially for swing-arm robots with
more complex working modes. Thus, this research proposes an algorithm to
delimitate AOI for swing-arm troweling robot (SatAOI algorithm). By analyzing
characteristics of the robot and obstacle maps, mathematical models and
collision principles are established. On this basis, SatAOI algorithm achieves
AOI delimitation by global search and collision detection. Experiments on
different obstacle maps indicate that AOI can be effectively delimitated in
scenes under different complexity, and the algorithm can fully consider the
connectivity of obstacle maps. This research serves as a foundation for CPP
algorithm and full process simulation of swing-arm troweling robots.
☆ D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation
Learning bimanual manipulation is challenging due to its high dimensionality
and tight coordination required between two arms. Eye-in-hand imitation
learning, which uses wrist-mounted cameras, simplifies perception by focusing
on task-relevant views. However, collecting diverse demonstrations remains
costly, motivating the need for scalable data augmentation. While prior work
has explored visual augmentation in single-arm settings, extending these
approaches to bimanual manipulation requires generating viewpoint-consistent
observations across both arms and producing corresponding action labels that
are both valid and feasible. In this work, we propose Diffusion for COordinated
Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation
tailored to eye-in-hand bimanual imitation learning that trains a diffusion
model to synthesize novel, viewpoint-consistent wrist-camera images for both
arms while simultaneously generating joint-space action labels. It employs
constrained optimization to ensure that augmented states involving
gripper-to-object contacts adhere to constraints suitable for bimanual
coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our
results across 2250 simulation trials and 300 real-world trials demonstrate
that it outperforms baselines and ablations, showing its potential for scalable
data augmentation in eye-in-hand bimanual manipulation. Our project website is
at: https://dcodaaug.github.io/D-CODA/.
♻ ☆ Efficient Estimation of Relaxed Model Parameters for Robust UAV Trajectory Optimization
Online trajectory optimization and optimal control methods are crucial for
enabling sustainable unmanned aerial vehicle (UAV) services, such as
agriculture, environmental monitoring, and transportation, where available
actuation and energy are limited. However, optimal controllers are highly
sensitive to model mismatch, which can occur due to loaded equipment, packages
to be delivered, or pre-existing variability in fundamental structural and
thrust-related parameters. To circumvent this problem, optimal controllers can
be paired with parameter estimators to improve their trajectory planning
performance and perform adaptive control. However, UAV platforms are limited in
terms of onboard processing power, oftentimes making nonlinear parameter
estimation too computationally expensive to consider. To address these issues,
we propose a relaxed, affine-in-parameters multirotor model along with an
efficient optimal parameter estimator. We convexify the nominal Moving Horizon
Parameter Estimation (MHPE) problem into a linear-quadratic form (LQ-MHPE) via
an affine-in-parameter relaxation on the nonlinear dynamics, resulting in fast
quadratic programs (QPs) that facilitate adaptive Model Predictve Control (MPC)
in real time. We compare this approach to the equivalent nonlinear estimator in
Monte Carlo simulations, demonstrating a decrease in average solve time and
trajectory optimality cost by 98.2% and 23.9-56.2%, respectively.
comment: 8 pages, 5 figures, to published in IEEE Sustech 2025
♻ ☆ Uncertainty Comes for Free: Human-in-the-Loop Policies with Diffusion Models
Human-in-the-loop (HitL) robot deployment has gained significant attention in
both academia and industry as a semi-autonomous paradigm that enables human
operators to intervene and adjust robot behaviors at deployment time, improving
success rates. However, continuous human monitoring and intervention can be
highly labor-intensive and impractical when deploying a large number of robots.
To address this limitation, we propose a method that allows diffusion policies
to actively seek human assistance only when necessary, reducing reliance on
constant human oversight. To achieve this, we leverage the generative process
of diffusion policies to compute an uncertainty-based metric based on which the
autonomous agent can decide to request operator assistance at deployment time,
without requiring any operator interaction during training. Additionally, we
show that the same method can be used for efficient data collection for
fine-tuning diffusion policies in order to improve their autonomous
performance. Experimental results from simulated and real-world environments
demonstrate that our approach enhances policy performance during deployment for
a variety of scenarios.
♻ ☆ An End-to-End Framework for Optimizing Foot Trajectory and Force in Dry Adhesion Legged Wall-Climbing Robots
Foot trajectory planning for dry adhesion legged climbing robots presents
challenges, as the phases of foot detachment, swing, and adhesion significantly
influence the adhesion and detachment forces essential for stable climbing. To
tackle this, an end-to-end foot trajectory and force optimization framework
(FTFOF) is proposed, which optimizes foot adhesion and detachment forces
through trajectory adjustments. This framework accepts general foot trajectory
constraints and user-defined parameters as input, ultimately producing an
optimal single foot trajectory. It integrates three-segment $C^2$ continuous
Bezier curves, tailored to various foot structures, enabling the generation of
effective climbing trajectories. A dilate-based GRU predictive model
establishes the relationship between foot trajectories and the corresponding
foot forces. Multi-objective optimization algorithms, combined with a
redundancy hierarchical strategy, identify the most suitable foot trajectory
for specific tasks, thereby ensuring optimal performance across detachment
force, adhesion force and vibration amplitude. Experimental validation on the
quadruped climbing robot MST-M3F showed that, compared to commonly used
trajectories in existing legged climbing robots, the proposed framework
achieved reductions in maximum detachment force by 28 \%, vibration amplitude
by 82 \%, which ensures the stable climbing of dry adhesion legged climbing
robots.
♻ ☆ Fast Whole-Body Strain Regulation in Continuum Robots
We propose reaching steps towards the real-time strain control of
multiphysics, multiscale continuum soft robots. To study this problem
fundamentally, we ground ourselves in a model-based control setting enabled by
mathematically precise dynamics of a soft robot prototype. Poised to integrate,
rather than reject, inherent mechanical nonlinearities for embodied compliance,
we first separate the original robot dynamics into separate subdynamics --
aided by a perturbing time-scale separation parameter. Second, we prescribe a
set of stabilizing nonlinear backstepping controllers for regulating the
resulting subsystems' strain dynamics. Third, we study the interconnected
singularly perturbed system by analyzing and establishing its stability.
Fourth, our theories are backed up by fast numerical results on a single arm of
the Octopus robot arm. We demonstrate strain regulation to equilibrium, in a
significantly reduced time, of the whole-body reduced-order dynamics of an
infinite degrees-of-freedom soft robot. This paper communicates our thinking
within the backdrop of embodied intelligence: it informs our conceptualization,
formulation, computational setup, and yields improved control performance for
infinite degrees-of-freedom soft robots.
♻ ☆ A Machine Learning Approach to Sensor Substitution from Tactile Sensing to Visual Perception for Non-Prehensile Manipulation
Mobile manipulators are increasingly deployed in complex environments,
requiring diverse sensors to perceive and interact with their surroundings.
However, equipping every robot with every possible sensor is often impractical
due to cost and physical constraints. A critical challenge arises when robots
with differing sensor capabilities need to collaborate or perform similar
tasks. For example, consider a scenario where a mobile manipulator equipped
with high-resolution tactile skin is skilled at non-prehensile manipulation
tasks like pushing. If this robot needs to be replaced or augmented by a robot
lacking such tactile sensing, the learned manipulation policies become
inapplicable. This paper addresses the problem of sensor substitution in
non-prehensile manipulation. We propose a novel machine learning-based
framework that enables a robot with a limited sensor set (e.g., LiDAR or RGB-D)
to effectively perform tasks previously reliant on a richer sensor suite (e.g.,
tactile skin). Our approach learns a mapping between the available sensor data
and the information provided by the substituted sensor, effectively
synthesizing the missing sensory input. Specifically, we demonstrate the
efficacy of our framework by training a model to substitute tactile skin data
for the task of non-prehensile pushing using a mobile manipulator. We show that
a manipulator equipped only with LiDAR or RGB-D can, after training, achieve
comparable and sometimes even better pushing performance to a mobile base
utilizing direct tactile feedback.
comment: 10 pages, 6 figures, submitted to IEEE Sensors Journal, for
associated video, see https://youtu.be/6yIRcfn2DsY
♻ ☆ Demonstrating ViSafe: Vision-enabled Safety for High-speed Detect and Avoid RSS 2025
Parv Kapoor, Ian Higgins, Nikhil Keetha, Jay Patrikar, Brady Moon, Zelin Ye, Yao He, Ivan Cisneros, Yaoyu Hu, Changliu Liu, Eunsuk Kang, Sebastian Scherer
Assured safe-separation is essential for achieving seamless high-density
operation of airborne vehicles in a shared airspace. To equip
resource-constrained aerial systems with this safety-critical capability, we
present ViSafe, a high-speed vision-only airborne collision avoidance system.
ViSafe offers a full-stack solution to the Detect and Avoid (DAA) problem by
tightly integrating a learning-based edge-AI framework with a custom
multi-camera hardware prototype designed under SWaP-C constraints. By
leveraging perceptual input-focused control barrier functions (CBF) to design,
encode, and enforce safety thresholds, ViSafe can provide provably safe runtime
guarantees for self-separation in high-speed aerial operations. We evaluate
ViSafe's performance through an extensive test campaign involving both
simulated digital twins and real-world flight scenarios. By independently
varying agent types, closure rates, interaction geometries, and environmental
conditions (e.g., weather and lighting), we demonstrate that ViSafe
consistently ensures self-separation across diverse scenarios. In
first-of-its-kind real-world high-speed collision avoidance tests with closure
rates reaching 144 km/h, ViSafe sets a new benchmark for vision-only autonomous
collision avoidance, establishing a new standard for safety in high-speed
aerial navigation.
comment: 13 pages, RSS 2025 Demo track, https://theairlab.org/visafe/
♻ ☆ Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving
Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, Yadan Luo
End-to-end autonomous driving frameworks enable seamless integration of
perception and planning but often rely on one-shot trajectory prediction, which
may lead to unstable control and vulnerability to occlusions in single-frame
perception. To address this, we propose the Momentum-Aware Driving (MomAD)
framework, which introduces trajectory momentum and perception momentum to
stabilize and refine trajectory predictions. MomAD comprises two core
components: (1) Topological Trajectory Matching (TTM) employs Hausdorff
Distance to select the optimal planning query that aligns with prior paths to
ensure coherence;(2) Momentum Planning Interactor (MPI) cross-attends the
selected planning query with historical queries to expand static and dynamic
perception files. This enriched query, in turn, helps regenerate long-horizon
trajectory and reduce collision risks. To mitigate noise arising from dynamic
environments and detection errors, we introduce robust instance denoising
during training, enabling the planning model to focus on critical signals and
improve its robustness. We also propose a novel Trajectory Prediction
Consistency (TPC) metric to quantitatively assess planning stability.
Experiments on the nuScenes dataset demonstrate that MomAD achieves superior
long-term consistency (>=3s) compared to SOTA methods. Moreover, evaluations on
the curated Turning-nuScenes shows that MomAD reduces the collision rate by 26%
and improves TPC by 0.97m (33.45%) over a 6s prediction horizon, while
closedloop on Bench2Drive demonstrates an up to 16.3% improvement in success
rate.
comment: 16 pages, 8 figures
♻ ☆ SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
Efficient path planning in robotics, particularly within large-scale, dynamic
environments, remains a significant hurdle. While Large Language Models (LLMs)
offer strong reasoning capabilities, their high computational cost and limited
adaptability in dynamic scenarios hinder real-time deployment on edge devices.
We present SmallPlan -- a novel framework leveraging LLMs as teacher models to
train lightweight Small Language Models (SLMs) for high-level path planning
tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate
across scene graphs that compactly represent full-scaled 3D scenes. The SLMs
are trained in a simulation-powered, interleaved manner with LLM-guided
supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not
only enables SLMs to successfully complete navigation tasks but also makes them
aware of important factors like travel distance and number of trials. Through
experiments, we demonstrate that the fine-tuned SLMs perform competitively with
larger models like GPT-4o on sequential path planning, without suffering from
hallucination and overfitting. SmallPlan is resource-efficient, making it
well-suited for edge-device deployment and advancing practical autonomous
robotics.
comment: Paper is under review
♻ ☆ A highly maneuverable flying squirrel drone with agility-improving foldable wings
Drones, like most airborne aerial vehicles, face inherent disadvantages in
achieving agile flight due to their limited thrust capabilities. These physical
constraints cannot be fully addressed through advancements in control
algorithms alone. Drawing inspiration from the winged flying squirrel, this
paper proposes a highly maneuverable drone equipped with agility-enhancing
foldable wings. By leveraging collaborative control between the conventional
propeller system and the foldable wings-coordinated through the Thrust-Wing
Coordination Control (TWCC) framework-the controllable acceleration set is
expanded, enabling the generation of abrupt vertical forces that are
unachievable with traditional wingless drones. The complex aerodynamics of the
foldable wings are modeled using a physics-assisted recurrent neural network
(paRNN), which calibrates the angle of attack (AOA) to align with the real
aerodynamic behavior of the wings. The additional air resistance generated by
appropriately deploying these wings significantly improves the tracking
performance of the proposed "flying squirrel" drone. The model is trained on
real flight data and incorporates flat-plate aerodynamic principles.
Experimental results demonstrate that the proposed flying squirrel drone
achieves a 13.1% improvement in tracking performance, as measured by root mean
square error (RMSE), compared to a conventional wingless drone. A demonstration
video is available on YouTube: https://youtu.be/O8nrip18azY.
comment: Accepted to IEEE Robotics and Automation Letters. Project Page :
https://jgkang1210.github.io/fsdrone_ral/ , Video :
https://www.youtube.com/watch?v=tckIF3KCJig , Dohyeon Lee and Jun-Gill Kang
are co-authors
♻ ★ CloudTrack: Scalable UAV Tracking with Cloud Semantics
Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and
rescue scenarios to gather information in the search area. The automatic
identification of the person searched for in aerial footage could increase the
autonomy of such systems, reduce the search time, and thus increase the missed
person's chances of survival. In this paper, we present a novel approach to
perform semantically conditioned open vocabulary object tracking that is
specifically designed to cope with the limitations of UAV hardware. Our
approach has several advantages. It can run with verbal descriptions of the
missing person, e.g., the color of the shirt, it does not require dedicated
training to execute the mission and can efficiently track a potentially moving
person. Our experimental results demonstrate the versatility and efficacy of
our approach.
comment: 7 pages, 3 figures
♻ ☆ REHEARSE-3D: A Multi-modal Emulated Rain Dataset for 3D Point Cloud De-raining
Abu Mohammed Raisuddin, Jesper Holmblad, Hamed Haghighi, Yuri Poledna, Maikol Funk Drechsler, Valentina Donzella, Eren Erdal Aksoy
Sensor degradation poses a significant challenge in autonomous driving.
During heavy rainfall, the interference from raindrops can adversely affect the
quality of LiDAR point clouds, resulting in, for instance, inaccurate point
measurements. This, in turn, can potentially lead to safety concerns if
autonomous driving systems are not weather-aware, i.e., if they are unable to
discern such changes. In this study, we release a new, large-scale, multi-modal
emulated rain dataset, REHEARSE-3D, to promote research advancements in 3D
point cloud de-raining. Distinct from the most relevant competitors, our
dataset is unique in several respects. First, it is the largest point-wise
annotated dataset, and second, it is the only one with high-resolution LiDAR
data (LiDAR-256) enriched with 4D Radar point clouds logged in both daytime and
nighttime conditions in a controlled weather environment. Furthermore,
REHEARSE-3D involves rain-characteristic information, which is of significant
value not only for sensor noise modeling but also for analyzing the impact of
weather at a point level. Leveraging REHEARSE-3D, we benchmark raindrop
detection and removal in fused LiDAR and 4D Radar point clouds. Our
comprehensive study further evaluates the performance of various statistical
and deep-learning models. Upon publication, the dataset and benchmark models
will be made publicly available at: https://sporsho.github.io/REHEARSE3D.
♻ ☆ Symbolic and User-friendly Geometric Algebra Routines (SUGAR) for Computations in Matlab
Geometric algebra (GA) is a mathematical tool for geometric computing,
providing a framework that allows a unified and compact approach to geometric
relations which in other mathematical systems are typically described using
different more complicated elements. This fact has led to an increasing
adoption of GA in applied mathematics and engineering problems. However, the
scarcity of symbolic implementations of GA and its inherent complexity,
requiring a specific mathematical background, make it challenging and less
intuitive for engineers to work with. This prevents wider adoption among more
applied professionals. To address this challenge, this paper introduces SUGAR
(Symbolic and User-friendly Geometric Algebra Routines), an open-source toolbox
designed for Matlab and licensed under the MIT License. SUGAR facilitates the
translation of GA concepts into Matlab and provides a collection of
user-friendly functions tailored for GA computations, including support for
symbolic operations. It supports both numeric and symbolic computations in
high-dimensional GAs. Specifically tailored for applied mathematics and
engineering applications, SUGAR has been meticulously engineered to represent
geometric elements and transformations within two and three-dimensional
projective and conformal geometric algebras, aligning with established
computational methodologies in the literature. Furthermore, SUGAR efficiently
handles functions of multivectors, such as exponential, logarithmic,
sinusoidal, and cosine functions, enhancing its applicability across various
engineering domains, including robotics, control systems, and power
electronics. Finally, this work includes four distinct validation examples,
demonstrating SUGAR's capabilities across the above-mentioned fields and its
practical utility in addressing real-world applied mathematics and engineering
problems.
comment: 33 pages, 6 figures, journal paper accepted in ACM TOMS
♻ ☆ FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger
Geometrically accurate and semantically expressive map representations have
proven invaluable to facilitate robust and safe mobile robot navigation and
task planning. Nevertheless, real-time, open-vocabulary semantic understanding
of large-scale unknown environments is still an open problem. In this paper we
present FindAnything, an open-world mapping and exploration framework that
incorporates vision-language information into dense volumetric submaps. Thanks
to the use of vision-language features, FindAnything bridges the gap between
pure geometric and open-vocabulary semantic information for a higher level of
understanding while allowing to explore any environment without the help of any
external source of ground-truth pose information. We represent the environment
as a series of volumetric occupancy submaps, resulting in a robust and accurate
map representation that deforms upon pose updates when the underlying SLAM
system corrects its drift, allowing for a locally consistent representation
between submaps. Pixel-wise vision-language features are aggregated from
efficient SAM (eSAM)-generated segments, which are in turn integrated into
object-centric volumetric submaps, providing a mapping from open-vocabulary
queries to 3D geometry that is scalable also in terms of memory usage. The
open-vocabulary map representation of FindAnything achieves state-of-the-art
semantic accuracy in closed-set evaluations on the Replica dataset. This level
of scene understanding allows a robot to explore environments based on objects
or areas of interest selected via natural language queries. Our system is the
first of its kind to be deployed on resource-constrained devices, such as MAVs,
leveraging vision-language information for real-world robotic tasks.
comment: 11 pages, 5 figures
♻ ☆ SATA: Safe and Adaptive Torque-Based Locomotion Policies Inspired by Animal Learning
Peizhuo Li, Hongyi Li, Ge Sun, Jin Cheng, Xinrong Yang, Guillaume Bellegarda, Milad Shafiee, Yuhong Cao, Auke Ijspeert, Guillaume Sartoretti
Despite recent advances in learning-based controllers for legged robots,
deployments in human-centric environments remain limited by safety concerns.
Most of these approaches use position-based control, where policies output
target joint angles that must be processed by a low-level controller (e.g., PD
or impedance controllers) to compute joint torques. Although impressive results
have been achieved in controlled real-world scenarios, these methods often
struggle with compliance and adaptability when encountering environments or
disturbances unseen during training, potentially resulting in extreme or unsafe
behaviors. Inspired by how animals achieve smooth and adaptive movements by
controlling muscle extension and contraction, torque-based policies offer a
promising alternative by enabling precise and direct control of the actuators
in torque space. In principle, this approach facilitates more effective
interactions with the environment, resulting in safer and more adaptable
behaviors. However, challenges such as a highly nonlinear state space and
inefficient exploration during training have hindered their broader adoption.
To address these limitations, we propose SATA, a bio-inspired framework that
mimics key biomechanical principles and adaptive learning mechanisms observed
in animal locomotion. Our approach effectively addresses the inherent
challenges of learning torque-based policies by significantly improving
early-stage exploration, leading to high-performance final policies.
Remarkably, our method achieves zero-shot sim-to-real transfer. Our
experimental results indicate that SATA demonstrates remarkable compliance and
safety, even in challenging environments such as soft/slippery terrain or
narrow passages, and under significant external disturbances, highlighting its
potential for practical deployments in human-centric and safety-critical
scenarios.
♻ ☆ LUDO: Low-Latency Understanding of Deformable Objects using Point Cloud Occupancy Functions
Accurately determining the shape of objects and the location of their
internal structures within deformable objects is crucial for medical tasks that
require precise targeting, such as robotic biopsies. We introduce LUDO, a
method for accurate low-latency understanding of deformable objects. LUDO
reconstructs objects in their deformed state, including their internal
structures, from a single-view point cloud observation in under 30 ms using
occupancy networks. LUDO provides uncertainty estimates for its predictions.
Additionally, it provides explainability by highlighting key features in its
input observations. Both uncertainty and explainability are important for
safety-critical applications such as surgical interventions. We demonstrate
LUDO's abilities for autonomous targeting of internal regions of interest
(ROIs) in deformable objects. We evaluate LUDO in real-world robotic
experiments, achieving a success rate of 98.9% for puncturing various ROIs
inside deformable objects. LUDO demonstrates the potential to interact with
deformable objects without the need for deformable registration methods.
♻ ☆ An Efficient GPU-based Implementation for Noise Robust Sound Source Localization
Zirui Lin, Masayuki Takigahira, Naoya Terakado, Haris Gulzar, Monikka Roslianna Busto, Takeharu Eda, Katsutoshi Itoyama, Kazuhiro Nakadai, Hideharu Amano
Robot audition, encompassing Sound Source Localization (SSL), Sound Source
Separation (SSS), and Automatic Speech Recognition (ASR), enables robots and
smart devices to acquire auditory capabilities similar to human hearing.
Despite their wide applicability, processing multi-channel audio signals from
microphone arrays in SSL involves computationally intensive matrix operations,
which can hinder efficient deployment on Central Processing Units (CPUs),
particularly in embedded systems with limited CPU resources. This paper
introduces a GPU-based implementation of SSL for robot audition, utilizing the
Generalized Singular Value Decomposition-based Multiple Signal Classification
(GSVD-MUSIC), a noise-robust algorithm, within the HARK platform, an
open-source software suite. For a 60-channel microphone array, the proposed
implementation achieves significant performance improvements. On the Jetson AGX
Orin, an embedded device powered by an NVIDIA GPU and ARM Cortex-A78AE v8.2
64-bit CPUs, we observe speedups of 5648.7x for GSVD calculations and 10.7x for
the SSL module, while speedups of 4245.1x for GSVD calculation and 17.3x for
the entire SSL module on a server configured with an NVIDIA A100 GPU and AMD
EPYC 7352 CPUs, making real-time processing feasible for large-scale microphone
arrays and providing ample capacity for real-time processing of potential
subsequent machine learning or deep learning tasks.
comment: 6 pages, 2 figures
♻ ☆ Deployment-friendly Lane-changing Intention Prediction Powered by Brain-inspired Spiking Neural Networks
Accurate and real-time prediction of surrounding vehicles' lane-changing
intentions is a critical challenge in deploying safe and efficient autonomous
driving systems in open-world scenarios. Existing high-performing methods
remain hard to deploy due to their high computational cost, long training
times, and excessive memory requirements. Here, we propose an efficient
lane-changing intention prediction approach based on brain-inspired Spiking
Neural Networks (SNN). By leveraging the event-driven nature of SNN, the
proposed approach enables us to encode the vehicle's states in a more efficient
manner. Comparison experiments conducted on HighD and NGSIM datasets
demonstrate that our method significantly improves training efficiency and
reduces deployment costs while maintaining comparable prediction accuracy.
Particularly, compared to the baseline, our approach reduces training time by
75% and memory usage by 99.9%. These results validate the efficiency and
reliability of our method in lane-changing predictions, highlighting its
potential for safe and efficient autonomous driving systems while offering
significant advantages in deployment, including reduced training time, lower
memory usage, and faster inference.
♻ ☆ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding NeurIPS 2024
Complex 3D scene understanding has gained increasing attention, with scene
encoding strategies playing a crucial role in this success. However, the
optimal scene encoding strategies for various scenarios remain unclear,
particularly compared to their image-based counterparts. To address this issue,
we present a comprehensive study that probes various visual encoding models for
3D scene understanding, identifying the strengths and limitations of each model
across different scenarios. Our evaluation spans seven vision foundation
encoders, including image-based, video-based, and 3D foundation models. We
evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual
Grounding, Segmentation, and Registration, each focusing on different aspects
of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates
superior performance, video models excel in object-level tasks, diffusion
models benefit geometric tasks, and language-pretrained models show unexpected
limitations in language-related tasks. These insights challenge some
conventional understandings, provide novel perspectives on leveraging visual
foundation models, and highlight the need for more flexible encoder selection
in future vision-language and scene-understanding tasks. Code:
https://github.com/YunzeMan/Lexicon3D
comment: NeurIPS 2024. Project page: https://yunzeman.github.io/lexicon3d
Github: https://github.com/YunzeMan/Lexicon3D
♻ ☆ DiSPo: Diffusion-SSM based Policy Learning for Coarse-to-Fine Action Discretization
We aim to solve the problem of generating coarse-to-fine skills learning from
demonstrations (LfD). To scale precision, traditional LfD approaches often rely
on extensive fine-grained demonstrations with external interpolations or
dynamics models with limited generalization capabilities. For memory-efficient
learning and convenient granularity change, we propose a novel diffusion-SSM
based policy (DiSPo) that learns from diverse coarse skills and produces
varying control scales of actions by leveraging a state-space model, Mamba. Our
evaluations show the adoption of Mamba and the proposed step-scaling method
enable DiSPo to outperform in three coarse-to-fine benchmark tests with maximum
81% higher success rate than baselines. In addition, DiSPo improves inference
efficiency by generating coarse motions in less critical regions. We finally
demonstrate the scalability of actions with simulation and real-world
manipulation tasks.
comment: 12 pages, 10 figures